78
6.1
Fast Search: BLAST as an Example for a Heuristic Search
The following accelerations are used for sequence comparisons (technical term: sequence
alignment):
A so-called indexing first considers whether the database entry contains single short
words (3 letters for protein sequences or 11 nucleotides for nucleic acid sequences) that
are similar to the sequence of the molecule. If this is the case (a first “hit” or “hit” is
found), the system immediately searches whether there is another hit not too far away
(predefined window length). Only when this second hit is found does the BLAST algo
rithm start checking whether the remaining sequence letters of this database entry match
the search sequence. This exact comparison of the two letter sequences (“alignment”) is
also accelerated by “dynamic programming”, so that step by step more memory is avail
able for the comparison of search sequence and database entry.
Thus, we see two principles of bioinformatics: Since all important biomolecules (DNA,
RNA, proteins, but also, for example, carbohydrates and lipids) are built from recurring
building units, most biomolecules can be recognized by the sequence of these building
units, i.e. by their letter sequence (with each class of molecules using its own alphabet).
In the meantime, however, so much information about biomolecules has been stored in
large databases that a major part of the informatics work in bioinformatics consists of
using fast computational rules (algorithms) and conveniently constructed databases to
cope with this flood of information so well that the correct biomolecule can be identified
as quickly as possible (see Mount et al., 2004).
If you use BLAST on the NCBI website (https://blast.ncbi.nlm.nih.gov/Blast.cgi), for
example, I get a result very quickly (a few seconds, sometimes one to 2 minutes). In that
time, BLAST actually screens several billion nucleotides and many millions of sequence
entries. This is an amazing speed-up. We now want to understand how to speed up bioin
formatics searches in general so that you get a result quickly. This usually happens by
foregoing the perfect search and taking a program that uses shortcuts to get a near-perfect
solution.
When searching for a similar sequence, one way to do an exact search would be to
compare letter by letter to determine exactly where a local match with high similarity is.
Local similarity is therefore a popular choice for protein function searches, because you
can then move on from a subsequence whose similarity was found in the database to the
next best similarity. After I have recognized that a partial sequence, usually a protein
domain, has a certain function, I shorten my protein by this domain and now search for a
hit in the database with the remaining sequence, which then not so rarely assigns the next
piece of the sequence, often a whole domain again, with a suspected function, and so on.
6.1
2013
6 Extremely Fast Sequence Comparisons Identify All the Molecules That Are Present…